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Abstract — One of the efficient solutions of improving coverage 
and increasing capacity in cellular networks is the deployment 
of femtocells. As the cellular networks are becoming more 
complex, energy consumption of whole network infrastructure 
is becoming important in terms of both operational costs and 
environmental impacts. This paper investigates energy efficiency 
of two-tier femtocell networks through combining game theory 
and stochastic learning. With the Stackelberg game formulation, 
a hierarchical reinforcement learning framework is applied for 
studying the joint expected utility maximization of macrocells 
and femtocells subject to the minimum signal-to-interference- 
plus-noise-ratio requirements. In the learning procedure, the 
macrocells act as leaders and the femtocells are followers. At 
each time step, the leaders commit to dynamic strategies based 
on the best responses of the followers, while the followers 
compete against each other with no further information but the 
leaders' transmission parameters. In this paper, we propose two 
reinforcement learning based intelligent algorithms to schedule 
each cell's stochastic power levels. Numerical experiments are 
presented to validate the investigations. The results show that 
the two learning algorithms substantially improve the energy 
efficiency of the femtocell networks. 

Index Terms — Stackelberg game, resource allocation, energy 
efficiency, femtocell, algorithm/protocol design and analysis, re- 
inforcement learning. 



I. Introduction 

The insatiable desire for higher data rates and the re- 
quirement of ubiquitous internet access require a more dense 
deployment of base stations within the network cells. Whereas 
the traditional network infrastructures are less efficient, but it 
maybe not economical for the operators to make radical alter- 
nation to the current network architectures. Cellular networks 
are generally designed to provide large coverage and are not 
efficient in satisfying the need of ever increasing capacity- 
density. Therefore, cellular network deployment solutions 
based on femtocells are quite promising under this context (TJ. 
Due to the short transmit-receive distance property, femtocell 
techniques can greatly improve the indoor experience of the 
mobile users. 

The escalation of energy consumption in wireless com- 
munications directly leads to the growth of greenhouse gas 
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emission, which has been recognized as a major threat for 
environmental protection and sustainable development. Today, 
the increasingly rigid environmental standards have created an 
urgent need for green evolution in wireless communication 
networks 0, (3J. In wireless cellular networks, the radio 
access section is the main source of energy consumption, 
accounting for up to more than 70% of the total energy 
consumption. 

To meet the challenges raised by the exponential growth 
in mobile services and energy consumption, it's crucial to 
increase the energy efficiency in wireless cellular networks. 
This paper addresses the energy efficiency problem in fem- 
tocell networks. The problem of energy-efficient spectrum 
sharing and power allocation in cognitive radio femtocells 
was studied in |4|, where a three-stage Stackelberg game 
model was formulated to improve the energy efficiency. In 
0, Ashraf et al. proposed a novel energy saving procedure 
for the femtocell base station (FBS) to decide when to switch 
on/off. Hereinafter, we focus mainly on discussing the co- 
channel operation of femtocells with closed access. This is 
mainly due to the following reasons: 1) privacy concerns; 
2) limited backhaul bandwidth; 3) no coordination between 
the macrocells and femtocells on spectrum allocation; 4) high 
requirements on mobile terminals. 

On the other hand, in co-channel two-tier femtocell net- 
works, the cross-tier/co-tier interference greatly restricts the 
overall network performance. Thus the interference cance- 
lation in two-tier femtocell networks has become an active 
area of research. For the uplink transmission in two-tier 
femtocell networks, Chandrasekhar and Andrews [6 1 proposed 
a distributed utility-based signal-to-interference-plus-noise ra- 
tio (SINR) adaptation algorithm to alleviate the cross-tier 
interference at the macrocell from the co-channel femtocells. 
A Stackelberg game was formulated to study the resource 
allocation in two-tier femtocell networks, where the macrocell 
base station (MBS) protects itself by pricing the interference 
from femtocell users (FUs) 0. In JH], Jo et al. developed 
two interference mitigation strategies that adjust the maximum 
transmission power of FUs to control the cross-tier interference 
at a MBS. Regarding the downlink transmissions, Gumacharya 
et al. modeled the power allocation problem as a Stackelberg 
game to maximize the capacity of each station (9). And a 
macrocell beam subset selection strategy was used to reduce 
the cross-tier interference in two-tier femtocell networks in 

m. 

The unplanned deployment of femtocells results in unpre- 
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dictable interference patterns. Therefore, the interference in 
this scenario can not be handled by means of centralized 
network scheduling, because the number and locations of 
femtocells are unknown. For such networking environment, the 
femtocells are most likely to be autonomous, which motivates 
using the idea of reinforcement learning (RL) ifTTI for interfer- 
ence management. A realtime multi-agent RL algorithm that 
optimizes the network performance by managing the interfer- 
ence in femtocell networks was investigated in [12|. Bennis 
et al. Ifl3ll developed a distributed learning scheme based on 
Q-learning to manage the femto-to-macrocell cross-tier inter- 
ference in femtocell networks. Inspired by evolutionary game 
theory and machine learning, Nazir et al. [14] proposed two 
intelligent mechanisms for interference mitigation to support 
the coexistence of macrocell and femtocells. 

In this paper, we model the energy efficiency aspect of 
power allocation problem in femtocell networks as a Stackel- 
berg learning game, i.e., leader-follower learning process, with 
the following characteristics: 1) the macrocells are considered 
to be the leaders, whereas the femtocells are considered to be 
the followers; 2) the leaders behave by knowing the response 
of the femtocells to their own strategy decisions; 3) given 
the leaders' decisions, the followers compete with each other. 
Learning is accomplished by directly interacting with the 
surrounding environment and properly adjusting the strategies 
according to the realizations of achieved performance. The 
solution of such a learning game is the Stackelberg equilibrium 
(SE). If no hierarchjQ exists during the learning procedure, 
the Stackelberg learning game reduces to the non-cooperative 
learning game, which is the scenario discussed in IfTTI . Energy 
efficiency in wireless networks were studied using Stackelberg 
games in |[T5l . Ifl6l . 

Compared to the previous works, the main contributions of 
this paper are summarized as follows: 

• Firstly, for the energy efficiency problem in the femtocell 
networks, we propose a Stackelberg learning game for all 
users to jointly leam the optimal transmission strategies. 

> Secondly, we develop a reinforcement learning based hi- 
erarchical power adaptation algorithm (RLHPA-I) where 
the learning rule for FUs is based on each FU's private 
and incomplete information, and the MU behaves as the 
role of leader and learns the optimal transmission con- 
figuration by obtaining all FUs' strategy information; the 
trajectory of the learning dynamics is also investigated. 

> Thirdly, in order to encourage the potential cooperation 
among the FUs, a second reinforcement learning based 
hierarchical power adaptation algorithm (RLHPA-II) is 
further proposed, where the FUs' learn the optimal trans- 
mission strategies through conjectural beliefs over other 
competing FUs' stochastic behaviors; the convergence of 
the learning procedure is proved theoretically. 

The rest of this paper is organized as follows. The next 
section presents the energy efficiency problem in femtocell 
networks and defines a Stackelberg game theoretic solution for 
the users' hierarchical behaviors. In Section III, a Stackelberg 

'in this paper, the hierarchy means that the knowledge levels of the users 
are asymmetric. 




Fig. 1. A typical femtocell network deployment (MBS: macrocell base 
station; MU: macrocell user; FBS: femtocell base station; FU: femtocell user). 

learning framework is proposed and the existence of SE is also 
investigated. Two reinforcement learning based algorithms are 
derived in Section IV and Section V. The numerical results are 
included in Section VI, verifying the validity and efficiency of 
the proposed algorithms. Finally, we present in Section VII a 
conclusion of this paper. 

II. Problem Formulation 

In this section, we first present the Stackelberg game formu- 
lation for the energy efficiency problem in femtocell networks. 
After that, the Stackelberg equilibrium of the proposed game 
is investigated. 

A. Stackelberg Game Formulation 

The femtocell network scenario we considered in this paper 
is illustrated in Fig. Q] where there exist multiple femtocells 
and macrocells. Each macrocell consisting of a MBS and 
multiple macrocell users (MUs), is underlaid with several co- 
channel FBSs. In each femtocell, there is one FBS providing 
service to femtocell users (FUs). Here we assume the closed 
independent policy JT], since private customers may prefer 
that kind of policy because of privacy concerns and limited 
backhaul bandwidth. Assuming identical distribution of femto- 
cells in various neighboring macrocells, we focus our emphasis 
on the case of one representative macrocell for simplification, 
without loss of generality. Suppose N femtocells Bi(i > 1) 
operate within the coverage of a macrocell Bq. Users of the 
same macrocell/femtocell adopt time division multiple access 
(TDMA) for data transmission, thus causing no interference 
within the same macrocell and femtocell. In the following, 
this paper mainly addresses the uplink transmissions for the 
distributed femtocells and the underlaid macrocell sharing a 
common spectrum band. 

Let i G Af denote the scheduled user connected to its BS 
Bi, where Af = {0, 1, . . . , N} refers to the index set of the 
MU and various FUs belonging to the same coexisting cellular 
region. Designate the transmission power level of user i as 
Pi (pf lm <Pi< P f ax ) , the SINR 7j of user i received at Bi 
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is given by 



1* < li (Pi,P- 



jeJV\{i} 



h 3,iPj 



(1) 



where 7* represents the minimum SINR requirement, a 2 is 
the variance of background Additive White Gaussian Noise 
(AWGN), {hj i} is the set of channel gains from user j to Bi, 
and p i is a vector of power allocation for all users except user 
i, i.e., = (p , . . . . . . ,pjv). In order to protect 

the MU's transmissions, we propose that the macrocell sets 
a power mask constraint for all femtocells [18], that is, the 
transmission power level of user i 6 TV\{0} over the shared 
spectrum is constrained by 



Pi Pmask- 



(2) 



A power mask prescribes the maximum transmission power 
that the femtocells may use over the spectrum. From a practical 
viewpoint, this is much easier for the operators to manipulate 
for the scenarios where the number of active femtocells varies 
in time and space. 

In the system under investigation, each user i G TV is 
selfish in the sense of its own energy efficiency, which can 
be expressed as [4|, [19| 



m Oi,p-») = 



Wlog 2 (l + j i (p l ,p- l )) 

Pa +Pt 



(3) 



where W is the spectrum bandwidth, and p a denotes the 
additional circuit power consumption of devices during trans- 
missions (e.g., digital-to-analog converters, analog-to-digital 
converter, synthesizer,etc lEOl ) and is independent from the 
transmission power. Considering the QoS requirement in Eq. 
<H), we define the utility function of user i formally as 



Ui (Pi,P-i) 



mipuv-i), if i% ipi , p-i ) > it ; 



0. 



otherwise. 



(4) 



Eq. demonstrates interactions among the users. Each user 
i's strategy is to choose the power level p, that maximizes its 
utility, 



max m (j>i,p-i) 



(5) 



where Pi = [pf m ,pf ax ] is the strategy profile of user i, with 
p™ = nunCp^Pmask). Particularly, P„ = [p nin , pjf^] 
for the MU. 

In order to improve the energy efficiency, we introduce 
Stackelberg game IF2T1 in the considered networking environ- 
ment. Stackelberg game is a strategic game which consists of 
a leader and several followers competing with each other on 
some resources. Such a game formulation can be viewed as 
an intermediate scheme between the totally centralized power 
adaptation strategy and the non-cooperative strategy in ifTTl . 
In this paper, the MU representing the whole MUs' coalition 
is modeled as the leader, while the FUs as the followers. 
Therefore, a distinct hierarchy exists among the users; and 
the leader plays the game by knowing the reaction function of 
the followers. The followers behave competitively, given the 
actions of the leader. 



B. Stackelberg Equilibrium Solution 

Game theory studies the rational interactions among the 
players. For the proposed Stackelberg game formulation, the 
SE describes an optimal strategy for the MU if all FUs always 
response by playing their Nash equilibrium (NE) strategies in 
the smaller sub-game. In order to investigate the existence of 
an SE, we first define p* to be the best response to p_; if 



ut (p*, p_j) > Ui (p^ p_i) , Vpi e Pi. 



(6) 



User i's best response to p is denoted by BRi(p-i), maxi- 
mizing its utility function subject to the power constraints. Let 
Ms(po) be the NE strategy of the FUs if the MU chooses to 
play po, i.e. 

NE( Po ) = p 0, if Pi = BRi(p-i),Vi G TV\{0}. (7) 

Definition 1. The strategy profile (pJ,JVE(pg)) is an SE if 
and only if 



u (p* ,NE(p* Q )) > u ( Po ,NE( Po ))y Po e P . 



(8) 



The following theorem establishes the existence of the SE. 

Theorem 1. The SE always exists in our proposed Stack- 
elberg game with the MU leading and the FUs following. 

Proof : In the proposed game formulation, it is not difficult 
to find that each FU i 7^ strictly compete with other 
followers in a non-cooperative fashion, given the MU's ac- 
tion Vpo € Po- Therefore, a smaller non-cooperative power 
adaptation sub-game is formulated at the femtocell side G = 
(po,TV\{0}, {Pi}, {ui}). For a non-cooperative game, NE is 
a set of strategies, such that no player can benefit by changing 
its action unilaterally, assuming other players continue to use 
their current strategies. From the results in [0], [|22l . there is 
at least one NE in the sub-game, since for Vi € TV\{0} 

1) the strategy profile Pi is a non-empty, convex, and 
compact subset of some Euclidean space DV l ; 

2) Ui is continuous in (pi, . . . ,p,-_i,j?f+i, . . . ,Pn) and 
quasi-concave in pj. 

On the other hand, there is only one player at the macro- 
cell side, and the best response strategy of the MU can be 
straightforwardly obtained through solving problem (0. The 
above statement is thus proved. ■ 
We need to point out that in the Stackelberg game, the MU 
regards itself as the only leader and performs the Stackelberg 
strategy, and the FUs will act their best responses until reach 
the equilibrium (pq,NE( P q)). As the FUs, who are designated 
as the followers, are selfish, rational, and can not coordinate 
with each other. And they are going to play their best response 
strategies NE(p^). By knowing this, the MU who is designated 
as the leader has to transmit with power level Pq to maximize 
his utility function. 

III. Stackelberg Learning Framework 

In the Stackelberg learning game, each user in the network 
behaves as an intelligent agent, whose objective is to maximize 
its payoff. And the payoff is measured in utility function (e.g., 
Eq. which reflects the users' satisfaction of executing the 
strategy. The Stackelberg learning framework has two levels 
of hierarchy: 1) the MU learns to maximize its utility by 
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knowing the response strategies of all FUs for each possible 
play; 2) given the strategy of the MU, the FUs play a non- 
cooperative learning game among each other. The game is 
played repeatedly to learn the optimal transmission strategies. 

A strategy for user i G Af is defined to be a probability 
vector Tii = (%i(pi,i), . . . ,ir l (p l . rni )) G Ui, where TTi(pi,ji) 
means the probability with which the user i chooses action 
(transmission power) pij t G Vi, and 11^ is the strategy set 
available to user i. Since each user can only choose a power 
level from a finite discrete set, Vi is assumed to be a finite set 
with dimension m 2 ;. Then the expected utility function Ui for 
user i can then be expressed as follows 

Ui (7Ti,7r_i) = E [u^user j plays strategy 6 Af] 



, Ki-l, 7T i+ i, 



is a vector of 

,PN,j N ) is the 



where 7T_ 4 = (tto, . . . ,7r;_i,7r i+ i, . . . ,n N 
strategies for all other users, p = (po,j ,- 
vector of actions chosen by all users, and V — x i£j<fVi is the 
space of all action vectors. An action suggests a power level 
performed by the user, and we use an action and a transmission 
power level interchangeably in the following discussions. 

In the same way, we may have the following definition of 
SE in the proposed Stackelberg learning game. 

Definition 2. For any stationary strategjH of the MU, 
7To G Ilo, the best-response strategies of all FUs define an 
NE strategy NE(tt ), i.e. NE(ir ) = tt* o> if 

7T* = arg max Ui fa, 7r_i) , Vi G A/\{0}. (10) 



The MU's optimal strategy is then 



i"o = arg max U (tt ,NE(tt )) 



(11) 



Together (ttq,NE(ttq)) constitute a stationary strategy of SE 
for the Stackelberg learning formulation. 

Theorem 2. For the proposed Stackelberg learning game, 
there exist a MU's stationary strategy and a FUs' NE strategy 
that form an SE. 

Inspired by ll23l . we can prove Theorem 2 as follows. 

Proof: If the MU follows a stationary strategy ttq G Ilo, then 
the Stackelberg learning game is simplified to be a A-player 
stochastic learning game for the FUs. It has been shown in 
[21 1 that every finite strategic-form game has a mixed strategy 
equilibrium. In other words, there always exists a stationary 
NE(ttq) that is best response for all the FUs in our formulation 
of the stochastic power adaptation process. The rest of the 
proof follows directly from the definition of SE, and is thus 
omitted for brevity. ■ 

Therefore, if we can construct an asymptotically (with 
time t) stationary strategy {7r*|i G Af} converging to the SE 
(ttq,NE(ttq)), we will achieve the main goal of the Stackelberg 
learning power adaptation game in this paper. In the rest of 
this paper, we focus our emphasis on how to reach the optimal 
communication configuration through reinforcement learning 
approach. 

2 A strategy is said to be stationary, where 7r; = (7r;(l), . . . ,7r;(mi)) is 
not changing with time during the stochastic learning process. 



IV. Reinforcement Learning based Hierarchical 
Power Adaptation-I (RLHPA-I) 

A. Reinforcement Learning based Algorithm 

During the Stackelberg learning process, the MU behaves 
as the role of leader and knows the transmission strategy 
information of all FUs. Users with learning ability leam 
to maximize its individual expected utility function through 
repeated interactions with the surrounding networking environ- 
ment. Among many different implementations of above adap- 
tation mechanism, in this paper, we consider reinforcement 
learning, known as the so-called Q-learning ll24l . 11251 . where 
the users' strategies are parameterized through Q-functions 
that characterize the relative expected utility of a particular 
power level. In Q-learning, users try to find the optimal Q- 
values in a recursive way. More specifically, let Q\(pi,jf) 
denote the Q-value of user i's corresponding power level pi t j t 
at time t. Then, after performing the transmission power level 
Pi : j i according to its strategy n\ at time slot t, the Q-value is 
updated via the following rule 

Ql +1 M = (i - «*) Ql (PijJ + ofUi (pi di1 *U) . d2) 



where a* G [0, 1) is the learning rate, 7r* 



time t, and 



7r^) is the vector of other users' strategies at 



(p^p-i) n "Sep*,*.)- 

s£Af\{i} 

(13) 



Here p_ t = (po, jo , . . . ,Pi-i,j i ^ 1 ,Pi+i,j i+1 , • ■ ■ ,pN,j N ) is a 
vector of actions chosen by all users except user i over the 
action space V-i = X- s eN\{i}'Ps- 

The tradeoff between exploration and exploitation is a 
challenge issue in stochastic learning process. The goal of Ex- 
ploration is to continually try new actions, while exploitation 
aims to "capitalize" on already established actions. One key 
feature of reinforcement learning is that it explicitly considers 
the exploration /exploitation in an integrated way, such that 
the users not only reinforce the actions they already know to 
be good but also explore new ones. In general, one deals with 
this problem through using a probabilistic method for choosing 
actions, e.g., e-greedy selection [26| is an effective approach 
of balancing exploration and exploitation. One drawback, 
however, is that it might lead to globally suboptimal solution. 
Thus, we need to incorporate some way of exploring less- 
optimal actions. 

An alternative solution is to vary the action probabilities as 
a graded function of the Q-values. The most common method 
is to use a Boltzmann distribution, that is, the probability of 
choosing transmission power level pi j i at time t + 1 is given 

by 



Kid>i,ji) 



exp(Q* (p< jj /n) 



(14) 



where t% is a positive parameter called the temperature and 
controls the exploration/exploitation tradeoff II11T , A high 
temperature causes the action selection probabilities to be all 
nearly equal, while a low temperature results in big difference 
in selection probabilities for actions differ in their Q-values. 
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From Eq. ( fTSl i and Eq. ([T3T l, we can see that every user 
i's updating rule depends on the strategies of other users. 
The MU who has the role of leader, can learn the op- 
timal strategy according to Eq. (fTZt and Eq. (fT4] i. How- 
ever, as the follower, each FU i £ A/"\{0} can neither 
know other competing FUs' private strategy information 

^-(oa) = • • • ' ^-I'^l+i' • • • ' w n) nor the utilit y value 

u i(pi,jnP-i) before performing the action Pi t j t . The only 
information it has is the MU's transmission parameters, i.e., 
the selected transmission power levels. Thus the updating rule 
for FU i is transformed to 

0- +1 fe J J=(l-«/)Q*(^) 

+ a)Ui (po,j »Pi,j 4 » 7r -(o,i)) . ( 15 > 
where a\ 6 [0, 1) is the learning rate for the FUs, and 

P- ( o,,)6P-(oi) S eA/"\{0,i} 

(16) 

and P-(o,i) = (Pi,ji>- • ■ >Pi-i,ii-i»Pi+i,j 4+ i>' • ■ >PJV,i JV ) is a 
vector of actions chosen by all FUs except FU i over the action 

space V-(p,i) = X-seM\{o,i}'Ps- 

On the other hand, each FU i S A/"\{0} is able to 
compute the attainable utility Ui(Pi,i ( ,p_,) with the feedback 
information (Eq. (Q~|i) from its intended receiver. Under the 
Stackelberg learning framework, the MU behaves as the leader 
and makes decisions first. It's therefore assumed that the 
MU makes decisions every T(> 1) time slots, which is also 
defined as one episode. After each action is executed by the 
MU, all FUs repeatedly play the non-cooperative learning 
game during the episode. Suppose that the MU selects power 
level po.j according to its strategy ttq in episode k, the 

expected {7j(po,j'o) Pi, j» j 71 "- (o i)) at tmie slot t = {k — l)T+t e 
(t e — 1, . . . , T) can be estimated using recursion in Eq. ( [P71 >, 

where p* = (p*^,. . . M-i,j^Pl+i, n+1 , ■ ■ -.PwjJ is 
the vector of actions chosen by all other FUs except FU i at 
time slot t, and n^'* e_1 (pi,j 4 ) is the number of times when 
FU i selects power level pij i until time t e — 1 in episode fc. 

At any time slot t e € {1, . . . , T} in each episode fc, each FU 
i G AT\{0} is always supposed to know its own and the MU's 
actions. Substituting Eq. ( fTTI i into Eq. ( fT3T >, the Q-learning 
rule for FU j can then be rewritten as 

(pidJ = (!-«/) (pw«) + 4 e ^ e (Po,io 5 P«,iJ ■ 

(18) 

While the MU's learning algorithm resembles the standard 
single-agent Q-learning except for the fact that the expected 
utility is the utility accrued over one episode (i.e., T time 
slots), that is, 

Qo +1 (Pojo) = (1 - «?) Qo" (Pojo) + «f U fc (po Jo ) , (19) 
where af S [0,1) is the learning rate for the MU, and 

U§(pojo) = ^ E ^(po Jo ,^ 1)T+te )- (20) 
{ e e{i,...,T} 



Accordingly, the strategy updates in Eq. (TT4T > for the MU 
and the FUs are based on different time scales. Now we 
present the first reinforcement learning based hierarchical 
power adaptation algorithm for the Stackelberg learning game. 



RLHPA-I 



Initialization: 

1) t = l(such that k = 1), initialize Q-values Q\{pi,ji) for 
each user i e J\f and each action p i j i 6 V%. 

Learning: 

2) In episode k, the MU chooses action poj according 
to 7Tq and broadcasts this information to all FUs in the 
network. 

3) Set U^ k ~ 1)T (p OJo ,p idi ) = for each FU i € AA\{0} 
and each action pij t € Vi. For t = (k— 1)T+1, . . . , kT, 
do. 

(3.1) FU i selects an action pij i according to 7r* and sends 
its relevant strategy information to the macrocell. 

(3.2) All users measure their SINR 7$ with the feedback 
information of the intended receiver. If 7, > 7*, then 
rji (p) can be achieved; otherwise, the receiver can not 
receive correctly, thus obtains zero utility value. 

(3.3) The MU calculates Uo (po,j 1 ^-o) according to Eq. 

ai. 

(3.4) All FUs update U\ (po, j , Pi,ji ) basing on Eq. ( fTTl ). 

(3.5) All FUs update Q-values (pijj according to Eq. 
d. 

(3.6) All FUs update the strategies 7r- +1 (pijj according to 
Eq. CHI). 

(3.7) Set t = i + 1. 

4) The MU calculates Ug (po,j q ) according to Eq. d20l . 

5) The MU updates Q-values Qq +1 (po,j ) according to Eq. 

(O. 

6) The MU updates the strategies ttq +1 {pqj ) according 
to Eq. (O. 

7) k = k + l. 

End Learning 



The parameter T decides the number of time slots that all 
FUs play the game before the MU updates its transmission 
strategy. Note that the updating rules of the MU and the FUs 
happen at different time scales. The FUs' Q-values are updated 
in every time slot whereas for the MU, the update happens 
only once in T slots. 

B. Discussion of RLHPA-I 

From the definition of SE, it is clear that the convergence 
of RLHPA-I to an SE requires that the MU's learning process 
converges to the optimal strategy while the FUs' stochastic 
behaviors converge to the corresponding NE under this optimal 
strategy. In this subsection, we discuss the conditions for such 
a convergence. 

As already discussed, in the Stackelberg learning game we 
propose, the FUs behave as the followers in a smaller sub- 
game given the transmission strategy of the MU. In other 
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"i(pOjo>P<J*.P-(o,o) - U i e 1 (P0,j , Pi J t ) . , 
^• e (P0 j0 ,Pi, J - i )=<j n M.-i (p ^ ) + 1 + ^ froj /''•./ ) • j /',,: (17) 

t^" 1 (Po jo ) Pi,ji ) > otherwise. 



words, for each strategy of the MU, the FUs have a multi- 
agent reinforcement learning problem in which the goal is to 
learn the NE of the game. In our algorithm RLHPA-I, however, 
the FUs have independent learning processes that run simul- 
taneously, with each one corresponding to each action of the 
MU. We use identical single-agent learning schemes for these 
processes and as already noted, the FUs maintain separate 
and private Q-values for each of these process. Each of these 
learning processes proceeds during T time slots whenever the 
MU makes a decision according to its transmission strategy. 
That means each FU is also equipped with a standard single- 
agent reinforcement learning algorithm as the MU. Given that 
sufficient number of trails of the power levels have been 
executed, the FUs in our algorithm will converge to the NE 
responding to the MU's different transmission strategies. 

The following Lemma by Szepesvari and Littman ll27l es- 
tablishes the convergence of a general single-agent Q-learning 
process updated by a pseudo-contraction operator. Let Q be 
the space of all Q-values. 

Lemma. Assume that the learning rate a* in Eq. (|2TT > 
satisfies the sufficient conditions of Theorem in [24|, and 
the mapping TL* : Q > Q meets the following condition: 
there exists a number < (3 < 1 and a sequence x f > 
converging to zero with probability (w.p.) 1 as t — > oc, such 
that WH*Q* - n t Q*\\ < p \\Q* - Q*\\ + x l for all Q* G Q 
and Q* =E[H t Q*], then the iteration defined by 

Q t+1 = (1 - a*) Q* + a* (ft*Q*) , (21) 

converges to Q* w.p. 1. 

Theorem 3. RLHPA-I will always discover an SE strategy. 

Proof: We prove this by contradiction. Suppose that the 
process generated by Eq. ([PTt converges to a non-Stackelberg 
equilibrium. From previous discussion, we know that the long 
term behavior of RLHPA-I converges to stationary points. This 
means that stationary points that are not SEs are stable, which 
contradicting Theorem 2. ■ 

Note that, unlike in the conventional single-agent rein- 
forcement learning, in the considered Stackelberg learning 
problem, the MU's payoff value for performing a particular 
action is dependent on the outcome of a sub-game, which 
is played by the non-cooperative FUs in response to the 
MU's decision. When the FUs are in the process of learning 
their own transmission strategies, the outcomes of the smaller 
sub-games, and consequently, the utility values achieved by 
the MU, can typically be non-stationary. With non-stationary 
payoffs, the Lemma may not apply. In order to tackle this, we 
adopt an averaging procedure in our algorithm, as indicated by 
Eq. (|20| >. At each updating step, the MU uses Uq (po,j ), the 
averaged expected utilities from T non-cooperative learning 
games of the FUs. This provides the MU with utilities that 



are good approximations of the payoffs corresponding to the 
outcomes of the non-cooperative sub-game. 

V. Reinforcement Learning based Hierarchical 
Power Adaptation-II (RLHPA-II) 

In order to promote potential cooperation among the com- 
peting FUs, we further propose a simple and intuitive rule 
that each FU links its own current transmission strategy to the 
other FUs' strategies. Such a rule reflects an awareness that 
there are strategic interactions during the learning procedure. 
FUs with such beliefs may not correctly perceive how the 
future strategies of their competitors depend on the past. In 
this section, we propose a conjecture model concerning the 
way in which the FUs react to each other. 

A. Conjecture Model 

Each FU i G A/"\{0} thinks any change in its current 
transmission strategy will induce other competing FUs to 
make well-defined changes in the corresponding time slot. 
Specifically, we need to estimate FU i's expected contention 
measure at time slot t = (k — 1)T + t e , i.e., 6*(p_^ ^) = 
FLeA/Vo o 7r'(Ps Js ) in Eq. ([T6T l. through a conjectural belief 
6|(p_/ which is expressed as 

~ h \ (p-(cvo) = bi (p- (0 ,i)) ~ § i K'(Pi,jJ -Kifajt)) > ( 22 ) 

where the so-called reference points j|28l , 6i(p_( 0i )) and 
Wi(jpij t ), are specific belief and probability, and Si > is 
the belief factor. The reference points are considered as exoge- 
nously given. In other words, every FU i believes that a change 
of 71^ (piji ) — 7Tj (piji ) in its own strategy at time t will induce a 
change of Si (W*(p»j 4 ) — TTiCPijJ) m me expected contention 
measure correspondingly related to the transmission strategies 
of other FUs. It's necessary to point out here that although FU 
i may be aware that other FUs are subject to many influences 
on their strategies, when making its own decision, it is only 
concerned with other FUs' reactions to itself. That means FU i 
does not take into account whether or not FU s (s G M\{0, i}) 
might react to changes in transmission strategy made by FU 
v(v G A/\{0, i, s}). 

Among different possibilities of capturing the expected 
contention measure 6'(p_( the linear model represented 
in Eq. d22l is the simplest form based on which one FU can 
model the impact of its changes in transmission strategy to the 
other competing FUs. In the non-cooperative learning process, 
as intelligent agents, the FUs learn when they modify the 
beliefs based on the new achievements. More specifically, we 
allow the FUs to revise their reference points according to their 
previous observations. That is, each FU i sets bi(p_/ ^) and 



7 



Ki(pi,ji) to be b\ 1 (p_ (0 i) ) and tt* 1 (pj )3 - i ). Eq. (|22]i then 
becomes 

&i (p_ ( o,o) = b t* (p-(o,o) " ("id*.*) ~ ^ '(ft,^)) • 

(23) 

The conjecture model deployed by the FUs are based on the 
concept of reciprocity, which refers to the interaction mecha- 
nisms in which the FUs repeatedly interact when choosing the 
power level. If they realize that their probabilities of interacting 
with each other in the future is high, they will consider their 
influence on the strategies of other FUs, which is captured in 
the conjecture model by the positive parameter Si. Otherwise, 
they will act myopically, which is the same learning process 
as in previous Section IV. 

B. Conjecture based Reinforcement Learning Scheme 

Following the previous analysis, the Q-learning rule for FU 
i given by Eq. ( TT3T > is thus modified as Eq. (124-b . Therefore, we 
propose the second reinforcement learning based hierarchical 
power adaptation algorithm RLHPA-II to discover the SE 
strategy. We may notice that the RLHPA-II is quite similar 
to the RLHPA-I except that the FUs update their Q-values 
based on Eq. (H2) in RLHPA-I. 

The detailed description of RLHPA-II is given as follows. 



RLHPA-II 



Initialization: 

1) t = l(such that k = 1), initialize Q-values for 
each user i G Af and each action pij i e Vi, and belief 
factors Si for each FU i G TV. 

Learning: 

2) In episode k, the MU chooses action poj a according to 
7Tq, and the MBS broadcasts this information to all FUs 
in the network. 

3) For t = (k-l)T + l,...,kT, do. 

(3.1) FU i selects an action pij i according to 7r* and sends 
its relevant strategy information to the macrocell. 

(3.2) All users measure their SINR 7; with the feedback 
information of the intended receiver. If 7^ > 7*, then 
rji (p) can be achieved; otherwise, the receiver can not 
receive correctly, thus obtains zero utility value. 

(3.3) The MU calculates Uo [po j , 7r*_ ) according to Eq. 

<H3. 

(3.4) The MBS broadcasts strategy information ^Sq 1 to all 
FUs. 

(3.5) All FUs update b\ fp_( 0i ^ basing on Eq. (l23l . 

(3.6) All FUs update Q-values (pijj according to Eq. 

CD. 

(3.7) All FUs update the strategies 7r,* +1 (pi j 4 ) according to 
Eq. (O. 

(3.8) Set t = t + l. 

4) The MU calculates U§ (p ,j ) according to Eq. (l20i >. 

5) The MU updates Q-values Qq +1 (po,j ) according to Eq. 

6) The MU updates the strategies 7Tq +1 (poj ) according 
to Eq. (O. 



7) k = k + l. 
End Learning 



It's worth mentioning that during the learning process, every 
FU utilizes the other FUs' strategy information in previous 
time slot. Unlike RLHPA-I, in algorithm RLHPA-II, the FUs 
have multi-agent learning processes that relate to each other 
and run simultaneously. 

C. Theoretical Analysis of RLHPA-II 

Next, we concentrate on analyzing the convergence property 
of the RLHPA-II. The algorithm results in a stochastic process 
of obtaining the vector of action selection probabilities, so 
we need to characterize the long-term behaviors of all users. 
Along with the discussion in Section IV-B, it only leaves 
us to prove the convergence of FUs' stochastic behavior in 
each episode k, given that T is large enough. For an N- 
FU stochastic learning game, we define the operator H tc as 
follows. 

Definition 3. Let Q*« = (Q\% . . . , Q%), where Q* c e Q, 
for i e Af\{0}, and Q = ILea^o} Q* H 4 =: Q ^ Q is a 
mapping on the complete metric space Q into Q, %* e (3* e = 
<#,..., where 

u * (/' ' .. -I'-., -P - ) ~ ht i (P-(0,*)) ■ ( 25 ) 

Then we proceed to prove that Q* = E [H te Q*]. 
Proposition 1. For an 7V-FU stochastic game, 

Q* = E [n te Q*] , (26) 

where Q* = (&,... ,Q' N ). 
Proof: Since for Mi e Af\{0} 

Q* (Pi,ji) 

P_ (0 , l) e'P-(o, I ) sG7V\{0,j} 

(27) 

From the discussions in previous Section V-A, we have 

&*(P-(o,i)) = n se ^\{o, l } <(Ps,jJ- Thus ' 

Q*(p i , ji )=E[H t «Q*(Pi,n)], (28) 

for all ptjt g Vi. ■ 

We further define the distance between any two Q-values. 
Definition 4. For any Q,Q' e Q, we define 

\\Q — Q'\\ = max max I Q, (p,- ,- 4 ) — Q\ (p 4 ,- 4 ) I . (29) 

iG7V\{0} Pi,j i £Pi 

Proposition 2. H tc is a contraction mapping operator. 
Proof: According to Definition 3, we have Eq. d30l >. 
Next, we discuss the item Ep_ (o i)e -p_ (D , i ) [^(P-(0,i)) - 

K(p-(o,i))] u i(poj ,Pi,ji,P-(o,i)) in E q- ®' Due t0 the fact 

that the reference points are exogenously given and of common 
knowledge, we may have Eq. OTb . 



8 



(l>i 



a 



l C (P-(0,i)) 



(24) 



\n u Q-H u Q'\ 



max max YhL te QihiJi) — W^'QiiPUt) I 



max max 

ie7^\{o}Pi,j i eP i 



E 

P-(o,ijeP-(o,.) 



(p-(o,»)) - ^ (p-(o,»)) u * (po 



'JO'-Pij'i'P-(0 



...) 



(30) 



E P (P-(0,i)) ~ % (p-(0,i)) Wi (POjo.Pi,ioP-(0,i)) 

X] $ [^(Piji) _ ^'(Pi.ii)] u * (po,.7o>Pi,j«>P-(0,i)) 



(31) 



P-(o,i)G'P-(o,i 



Now, we need to concentrate on the item ^(pi j,). By 
applying Eq. (fl4l . we have 



^(^)- Epep4 e xp( g. (p)/Ti) - 
When r, is sufficiently large, we have 

/ \ / \ , , Qi (Pi,ji) . ( Qi (Pi,n) 
(Piji ) In) = i + — + p — 



exp (Qi 



(32) 



, (33) 



where ¥\Qi(Pi,ji) I T i) i s a polynomial of the order 
0((Qi(Pi.ji)/ T i) 2 )- It's then straightforward to derive 



exp(Q 4 (p)/r. i 



E 



Q<(p) , (QM 
h ^ 



It can be easily verified that 



Ti(P<,ii) 



1 



1 



(Pi 



Qi(p)' 



m, m, 



(34) 



(35) 



where g({Qi(p)/Ti}p) is a polynomial of order smaller than 
0{{Qi(Pi,ji)/ T i}p) ■ Note that the coefficient of the polyno- 
mial is independent of the Q-value. Similarly, we may obtain 



Ti (Pi,3i) = + — + Q 

rrn m, n 




(36) 



Substituting Eq. §35} and Eq. (O to Eq. (fj]} establishes 
Eq. (f3Tb . This means we can always take a sufficiently large 
Ti such that Eq. d38T l is satisfied, where < Aj < m,. This 
implies 

II^Q-^Q'll 



max max — -iQifei 

«eAT\{o} Pi,j 4 en mj 1 

<u}\\Q-Q'\\, 



Yi (pi 



(39) 



where uj = max ie ^-\{ } ^fr- It's obvious that a; < 1. 

Therefore, "H tc is a contraction mapping operator. This 
concludes the proof. ■ 



We can now present the main result in this section that the 
learning process induced by the RLHPA-II in each episode 
converges. 

Theorem 4. For each FU i e A/"\{0}, regardless of any 
initial value chosen for Qi(pi,j t ), if the temperature Ti is 
sufficiently large, the FUs' stochastic behaviors converge. 

Proof: The proof can be completed by directly applying 
Lemma, which establishes the convergence given two condi- 
tions. First, Ti 1 " is a contraction mapping operator, by Propo- 
sition 2. Second, the fixed point condition, Q* = E[H t "Q*], 
is ensured by Proposition 1. Therefore, the learning process 
expressed by Eq. d24"l > converges. ■ 

VI. Numerical Results 

We provide insight into the performance comparison of 
the both learning algorithms through numerical simulations. 
We consider a representative macrocell scenario where there 
are two FUs coexisting with one MU over a spectrum with 
bandwidth of 1MHz. The minimum SINR targets of MU and 
FUs are assumed to be 3dB and 5dB, respectively. The noise of 
the measurement is according to a zero-mean Gaussian noise 
with the power of a 2 = — HOdBm, and the additional circuit 
power consumption is lOdBm for all users. The femtocells are 
uniformly distributed within a circle area centered at the MBS 
with radius of 500m. The coverage radius of femtocell is 20m. 
The channel gains are generated by a log-normal shadowing 
pathloss model, hij = d^* 1 , where dij is the distance between 
user i and BS j, and n is the pathloss exponent. In simulation, 
n is assumed to be 4. 

The action set of transmission power levels for all users is 
{20, 25, 30}dBm. Each episode contains T = 100 time slots. 
For simplicity, we suppose that the belief factors Si are all 
equal to 2 in RLHPA-II, Vi G A/\{0}, i.e., the FUs have the 
same conjecture ability. Further, we use the following learning 
rates for the MU and the FUs, 
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(40) 
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h (P-(o,i)J ~ b i 



E 

P-(o,i)6^-(o,i) 

(po,j ,Pi,ji,p- 



i (P-(0,i)) w * (po.jo^PiJ^P-Co.i) 



P-(0,O e:P -(O,i) 



(0,i) 



Qi(p) 



(37) 



E 

P-(0,() e ''-(O.i) 



&» (P-(0,i)) - &i (P-(0,i)) u * (po,j )Pi,ii)P-(0,i) 



(38) 



where a*, ai G [0, 1) are the initial learning rates, and 9 > 1 
is a scalar and is set to be 1.1 in our simulations. 

The curves in Fig. [2] Fig. [3] and Fig. |4] show the learning 
process of expected utilities for each user in the network. The 
results are compared with 

1) The fully cooperative power allocation game with com- 
plete information exchange (Case I): each user knows 
all the utility functions and transmission power levels of 
other users in the network, and then the optimal utilities 
in the power allocation process can be obtained by each 
user according to Eq. (0 through exhausted searching. 
This scenario is equivalent to the classic power control 
game in femtocell networks without the pricing schemes 
from macrocells as shown in J6). 

2) The non-cooperative learning process of the power con- 
trol game without any private information exchange 
(Case II): each user's transmission decisions in the 
learning process are self-incentive with myopic best 
response correspondence, which is the similar scenario 
discussed in ifTTl . 

The first observation from our simulation results is that, 
whenever we generate random initial probability distributions 
of the power levels, the equilibrium state of the transmission 
strategies achieved by all users is independent of these initial 
values. That is, there exists an SE in the Stackelberg learning 
game, which confirms Theorem 2. 

Secondly, we can find from the curves that the expected 
utilities of all users in the learning process will finally converge 
(or approach) to the equilibrium point in the complete coopera- 
tion case, and these simulation results validate the conclusions 
of Theorem 3 and Theorem 4. In addition, the proposed 
reinforcement learning based schemes both outperform the 
non-cooperative case, which is because for a Stackelberg learn- 
ing game, knowing more can improve not only the leader's 
(MU) own utility, but also the utilities of the followers (FUs). 
Meanwhile, the RLHPA-II can achieve better performance than 
RLHPA-I, which is due to the fact that all FUs have the 
incentive to achieve better utilities thus behave reciprocally 
by exchanging transmission parameters in the previous time 
slot (indicated by Eq. (1231). 

Fig. |5] shows the expected SINRs of FUs using RLHPA-I 
and RLHPA-II, respectively, versus the macrocell's minimum 
QoS requirement 7q. As expected, a higher 7q results in higher 
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Fig. 2. Learning process of the expected utilities for FU 1. 
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Fig. 3. Learning process of the expected utilities for FU 2. 



interference caused by the MU to the FUs, i.e., the achieved 
performances are degraded. Further, it can be observed that 
for the same 7^, the expected SINRs of the FUs with RLHPA- 
II is in general higher than that with RLHPA-I. This is in 
accordance with our previous discussions. It is also worth 
mentioning that when 7q is sufficiently large, the expected 
SINRs of the FUs approach to zero for the two learning 
algorithms. This is because when 7q is sufficiently large, there 
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Fig. 4. Learning process of the expected utilities for MU. 
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Fig. 5. The expected SINRs for FUs versus 7J. 

is no femtocell active in the networks. 

VII. Conclusion 

In this paper, energy efficiency is investigated for the 
uplink transmission in a spectrum-sharing-based two-tier fem- 
tocell network using stochastic learning theory combined with 
Stackelberg games. The Stackelberg learning framework is 
adopted to jointly study the utility maximization of the MU 
and FUs. Based on reinforcement learning, we propose two 
intelligent algorithms, namely, RLHPA-I and RLHPA-II, whose 
convergence properties have also been proven theoretically. 
Numerical experiments illustrate that the reciprocity-inspired 
RLHPA-II converges more quickly and achieves better utility 
performance compared to RLHPA-I and the non-cooperative 
learning scheme. This comes at the expense of obtaining 
more side strategy information at the FUs. Concludingly, both 
learning algorithms show the potential in improving the energy 
efficiency in the greener femtocell networks. 
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